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Abstract. In the context of kernel density estimation, we give a 
characterization of the kernels for which the parametric mean inte- 
grated squared error rate may be obtained, where n is the sample 
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totic bandwidth choice that makes the kernel estimator consistent in 
mean integrated squared error at that rate and a numerical example 
showing the superior performance of the superkernel estimator when 
the bandwidth is properly chosen. 
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1 Introduction 



If Xi , . . . , Xn is a sample from a probability distribution on the real line with 
density /, the kernel density estimator is given by 

1 " 

fn,K,h{x) = - ^Kh{x - Xi), 
1=1 

where the kernel K is an intcgrable function with j K = 1, the bandwidth h 
is a positive real number and we have used the notation Kh{x) = K{x/h)/h; 
sec, e.g., Silverman (1986), Simonoff (1996) or Wand and Jones (1995). The 
L2 error criterion will be used here; that is, we will measure the error of the 
estimate fn,K,h through the mean integrated square error (MISE), defined by 

MISE^(^,^,,) = EfJ [fn,K,h{^) - f{x)]'dx. 

We will assume henceforth that all the kernels below are bounded functions, 
continuous at zero and such that J K'^ < 2K{0). This technical conditions 
ensure that an optimal bandwidth /ion(/) — ^-rgmin^j-^Q MISEj(/„ ^ /j) exists 
(see Chacon et al., 2006). 

The main goal of this paper is to characterize the kernel functions that 
make the MISE converge to zero as fast as possible. Most commonly used 
kernels are the positive ones, because they produce bona fide density esti- 
mators; that is, estimators that, for every observed sample, provide a true 
density function (i.e., fn,K,h > and J fn,K,h = !)• However, it is widely 
known that for positive kernels the MISE cannot decrease to zero faster than 
n~^/^ (Rosenblatt, 1956). In this sense, some benefit can be obtained if we 
allow the kernel to take negative values (see Theorem 1 below), although 
the price to be paid is that the resulting estimate is not a positive function. 
Nevertheless, in a recent paper. Glad, Hjort and Ushakov (2003) show that, 
based on a non-bona fide estimator, it is possible to construct a bona fide 
one with even smaller MISE. Thus, there is no reason, in terms of MISE, to 
avoid the use of kernels taking negative values in density estimation. 
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Watson and Leadbetter (1963) showed, in a very general background, 
that the MISE of kernel density estimators cannot decrease faster than n~^. 
Davis (1977) characterized the class of densities for which this "parametric" 
rate can be achieved (see Theorem 3 below). In this paper, we give a 
characterization of those kernels for which the MISE of the corresponding 
kernel estimator goes to zero at rate for some density, so that together 
with the result of Davis (1977) we obtain a precise description of the family of 
densities and kernels for which the parametric rate is attainable (see Theorem 
4). Besides, for this family we provide practical bandwidth-choice advice for 
achieving this rate. 

2 Main results 

Let us recall some facts about kernels. If we denote by mj{K) = J x^K{x)dx 
the j-th moment of a kernel K, we say that K is of finite order if the set 

AK^{jeN,j>l:mj{K)^0} 

is non-empty. In this case, k = min Ak is called the order of the kernel K. If 
Ak = then it is said that K is a kernel of infinite order and such a kernel 
should satisfy mj{K) = for j = 1, 2, . . . 

An example of an infinite order kernel is Natterer's kernel, whose char- 
acteristic function is given by (f{t) = e^*^/*^^^*^)/[_ij](t), where stands for 
the indicator function of the set A (see Devroye and Lugosi, 2001, Ch. 17). If 
K is the density of a symmetric distribution with finite variance, then K is a 
kernel of order 2. A method for constructing a kernel of arbitrary finite order 
is shown in Schucany and Sommers (1977); however, if we want a kernel K 
to have order k > 2 then K must necessarily take negative values. 

Let us denote 

<l>(n,/,i^) = minMISE/(/„,^,;,) 

h>0 

that is, $(n, /, K) is the minimal MISE that can be achieved when we use the 
kernel K and a sample of size n to estimate /. The reason for using kernels 
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of order greater than 2 (non-positive, therefore) rests upon the following 
theorem, which can be found, for instance, in Wand and Jones (1995). 

Theorem 1. If K e L2 is a symmetric kernel of finite order k and the 
density f has a k-th continuous derivative belonging to L2 then the minimal 
MISE that may he obtained by estimating f using a kernel estimator with 
kernel K is of exact order 77,-2^/(2^+1) . ^^^^ 

lim n2'=/(2'=+^)$(n,/,i^) =ai, 

n— >-oo 

where cci e (0, 00) is a constant depending on f and K. 

Thus, as we make the kernel order grow, the rate of convergence of the 
optimal MISE to zero approaches the parametric rate rT^, although the class 
of densities for which this rate is vahd gets smaller and smaller. The question 
is: is there any kernel that effectively attains the rate for some density? 
The kernels that achieve that MISE-rate for some density will deserve to be 
called superkernels; that is, a superkernel will be a kernel K which satisfies, 

lim n$(n, /, K) = a2 

n— >oo 

for some density /, with < 0:2 < 00. As stated in the previous section, our 
purpose here is to give a characterization of such superkernels. 

In view of Theorem 1, one is tempted to conjecture that an infinite order 
kernel is a good candidate to be a superkernel; however, we will see below 
that an infinite order kernel does not need to be a superkernel. 

Denote by (fixit) the characteristic function of a kernel K and 

Sk = inf{t > 0: |^i^(t) - 1| ^ 0} 

Tk = mi{r > 0: \(pK{t) - 1| 7^ a.e. for t > r}. 

That is, Sk is the greatest value of r such that (px is identically equal to 1 
on [0,r] and Tk is the greatest value of t such that (pK{t) — 1. Notice that 
nearly every kernel used in practice satisfies Sk = Tk- 

The next result gives a characterization of the class of superkernels, in 
terms of their characteristic functions. 
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Theorem 2. Let K be a kernel in L2 such that Sk — Tk- The following 
statements are equivalent: 

i) Sk > 0. 

a) /, K) is of exact order for some density f & L2. 

The previous theorem allows us to give an alternative (and equivalent) 
definition of a superkernel: we will say that a kernel K with Sk — Tk is 
a superkernel if Sk > 0; that is, if its characteristic function is identically 
equal to 1 in a neighborhood of the origin. This is just the classical definition 
of superkernel used in Devroye (1992) or in Glad, Hjort and Ushakov (2003), 
for instance. Thus, although this definition is not very intuitive. Theorem 
2 allows us to conclude that it is just the one that we were looking for. 
Besides, from this characterization it follows that Natterer's kernel, which 
has infinite order, is not a superkernel; that is, the minimal MISE that we 
obtain using Natterer's kernel cannot decrease to zero at rate for any 
density. A classical example of superkernel is given by the trapezoidal kernel 
K{x) = {cosx — cos(2,T))/(7r,T^), which has characteristic function (pK{t) — 
-^[o,i)(|^|) + (2 — |t|)/[i,2)(|^|), so that Sk = Tk = 1; see Devroye and Lugosi 
(2001). Some more examples of superkernels are included in Section 3 of 
McMurry and Politis (2004), they are called infinite order fiat-top kernels 
there. 

The characterization of the class of densities for which the rate n^^ is 
attainable is given in a paper by Davis (1977). Let us denote by (pf{t) the 
characteristic function of a density / and 

Cf = sup{r > 0: (pf{t) ^ a.e. for t e [0,r]} 
Df = sup{t > 0: ipf{t) ^ 0}. 

Notice that the support of is contained in {—Df,Df); moreover, this 
interval coincides with the support in the common case where Cf = Df. 

Theorem 3 (Davis, 1977). Let f be a density in L2. The following state- 
ments are equivalent: 
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i) Df < oo; i.e., (pf has bounded support, 
a) $(n, /, K) is of exact order for some kernel K & L2. 

Davis' theorem states that in kernel density estimation the MISE may 
decrease to zero at rate n"^ only if the characteristic function of the density 
we aim to estimate has bounded support. An example of this kind of density 
is given by the Fejer-de la Valle-Poussin density, f{x) = (1 — cosx)/(7ra;^), 
which has characteristic function (pf{t) = (1 — Davis (1977) even 

provides a kernel estimator that achieves the parametric rate if the bandwidth 
is properly chosen (see also Ibragimov and Khasminskii, 1982); however, her 
estimator is based on the sine function S{x) = (sin,T)/(7ra;^), which is not a 
kernel as it is not an integrable function. In contrast, our Theorem 2 is valid 
for true kernel functions and gives a condition that is not only sufficient but 
also necessary for kernel density estimation at a parametric rate. 

We can combine theorems 2 and 3 to get: 

Theorem 4. Let K be a kernel with Sk — and f a density, both in L2. 
Then, 

$(n, /, K) is of exact order iff Sk > and Df < 00. 

The theorem above gives a precise characterization of the only case where 
kernel density estimation at a parametric rate is possible. Then, we may 
wonder what would happen if we use a superkernel when the density does 
not fulfil the condition Df < 0, i.e., when kernel density estimation at a 
parametric rate is not possible. In the Li context, Devroye (1992) showed 
that superkernel estimators are rate- adaptive, in the sense that they achieve 
the best possible rate that the density permits. Below we show that this is 
also the case in the L2 setup. 

Theorem 5. Let K be a superkernel and f be a density, both in L2. It is 
verified: 
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i) (Smooth case) If f has a k-th derivative in Li fl L2, then ^{n,f,K) 
goes to zero as 77,-2^/(2^+1) q^- faster; that is, the sequence 

is bounded. 

a) (Supersmooth case) If for some o; > and 7 > the integral 

IaM) = J e'^'^"Mt)\'dt 

is finite, then ^{n,f,K) goes to zero as {logny^°'/n or faster; that is, 
the sequence 

is bounded. 

Remcirk 1. We have borrowed the terminology "smooth" and "supersmooth" 
case from Glad, Hjort and Ushakov (1999), where a result similar to our The- 
orem 5 is shown for the sine kernel; see also Davis (1977). Notice that when 
Df < 00 we are in the supersmooth case for all a > 0. Some examples 
of densities with Ia,x{f) < 00 include the standard Gaussian (a = 2) and 
Cauchy (a = 1) densities. Also, it should be remarked that Theorem 3.1 
in Politis (2003) is the analogue to the previous result in a pointwise sense 
(rather than for the MISE criterion). 

Remcirk 2. Denote R{g) — J g{xydx for any g e L2. Prom the proof of The- 
orem 5 (see Section 4 below), in the smooth case the quantity n^^l^'^^^^'>^{n, /, K) 
can be bounded by 

/R(K\\ 2*^/(2^+1) 

(2k + l)(2A;)-2'=/(2'=+^) is') 

For all k, this bound depends on the superkernel K only through R{K)/Sk', 
therefore, we could try to find the supernernel K minimizing this value, as it 
is done in the finite-order case. For kernels of order 2, it is well-known that 
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the kernel minimizing an asymptotic version of the MISE is the so-called 
Epanechinikov kernel; see, e.g., Silverman (1986). Here, in the superkernel 
case, we have R{K) > Sk/t^ for all K. This lower bound is achievable if 
and only if </?i<:(i) = for all \t\ > Sk but clearly, among all the superkernels 
satisfying such a condition, the only one fulfilling R{K) — Sk/t^ is given by 
VK{t) = h-SK,SK]{^)-i which corresponds to (a rescaled version of) the sine 
kernel. In this sense, although the sine function docs not provide a proper 
kernel, it is the asymptotically optimal choice; that is, the analogue to the 
Epanechnikov kernel for the superkernel case. 

Although Theorem 4 seems to be of purely theoretical interest, as it says 
nothing about the main problem in kernel density estimation, the choice of 
the bandwidth, this issue may be solved by using the next result, which can 
be found in Chacon et al. (2006). Let us recall the notation /ion(/) for the 
L2-optimal bandwidth; that is, 

honU) = argminMISE/(/„,j^,/,). 

h>0 

Theorem 6. Let K be a kernel and f a density, both in L2. If Sk — Tk or 
Cj — Dj then 

honif) Sx/Df as 00. 

Moreover, if Sk > and Df < 00 then, for any fixed /i* e {OjSk/D/] (not 
depending on n ), we have 

Ef[f^^K,h.{x)] = f{x), for a.e. a; e M,Vn e N, 

so that MISEf{fn,K,hi,) is of exact order n''^. 

Remark 3. Theorem 6 suggests taking h = Sx/Df under the conditions of 
Theorem 4. This is an asymptotic selection, as it is the limit of the optimal 
bandwidth sequence but, also, in this case it provides us with an unbiased 
kernel density estimator, whose MISE goes to zero at a parametric rate. 
Indeed, in such a situation we can bound 

n$(n,/,X) < nMISE^(/„,^,5^/,,^) < DfR{K)/SK, 
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so that same argument as in Remark 1 shows that the sine kernel is also the 
asymptotically optimal choice for the case where Df < oo. 

Remark 4. Any bandwidth /i* as in the previous theorem may be called a 
global "zero-bias bandwidth" . In a similar way, Sain and Scott (2002) show, 
for non-negative kernels, the existence of local zero- bias bandwidths hQ{x), 
not varying with n, for every x in the region where / is convex. Using this 
local bandwidths they also get a rate, but with respect to the pointwise 
mean squared error. 

3 A numerical illustration 

Next we give a simple numerical example showing the performance of the 
superkernel estimators "at full power" , that is, in the optimal situation where 
the characteristic function of the density has bounded support. To do so, we 
are going to focus on the aforementioned Fejer-de la Valle-Poussin density 

, , , 1 — cos X _ 

TTX^ 

and the trapezoidal superkernel 

, cosx — cos(2a;) ^ 
K(x) = TT^^, xeR. 

TTX^ 

For this superkernel, we will use two different bandwidth selection ap- 
proaches: the first bandwidth is selected by a cross-validation method (see 
Silverman, 1986, or Wand and Jones, 1995); the second bandwidth comes 
from a version of the bandwidth selection procedure proposed by Politis 
(2003). This method aims to estimate Df making use of the empirical charac- 
teristic function, and it is closely related to the one proposed by Chiu (1991) 
for a similar problem in density estimation (see also Politis and Romano, 
1999). If ipn{t) — ^"^^ exp{itXj } denotes the empirical characteristic 
function, D/ is estimated by 

5„ = mi{D > 0: \^n{D + t)\' < c^,Vi e (0,4)}, 
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where c > is a fixed constant and {£„) is a positive nondecreasing sequence. 
As suggested in Remark 3, the chosen bandwidth is then /i„ = Fol- 
lowing the advice in Politis (2003), in all the simulations we have taken c — 1 
and in — 1. 

We want to compare this superkernel density estimator with the classical 
one, using a density function as a kernel. To this aim, we also include in the 
simulations the results for the Sheather- Jones method (Sheather and Jones, 
1991), which uses the standard normal density as the kernel, so that it is 
known that the MISE cannot decrease faster than (again, see Theorem 

1 above). 

We have tried these three methods for sample sizes n = 100 (small), 
n = 400 (medium) and n = 1600 (large) over 100 simulated samples of each 
size drawn from the Fcjer-de la Valle-Poussin density. The results arc shown 
in Tabic 1. For each estimator and sample size we give the average and 
standard deviation of the 100 vahics of lSE(/„) = 10'*^ x /(/„ — /)^- 



n 


ISEcv 


ISEsj 


ISEpoi 


100 


3.36 
(4.38) 


3.04 
(2.21) 


2.53 
(2.28) 


400 


2.59 

(1.11) 


0.902 

(0.519) 


0.612 

(0.365) 


1600 


1.10 
(0.811) 


0.348 
(0.172) 


0.179 
(0.132) 



Table 1: Simulation results for sample sizes n = 100,400,1600. Averages 
and (standard deviations) of the ISE are given for each method. 

As usual, it can be seen from Table 1 that the cross- vahdated selector is 
far more variable than the others. In this case, even the average ISE is also 
unacceptably large, when it is used together with a superkernel. In contrast, 
the selector of Politis does a good work: it is comparable with the Sheather- 
Jones method for small sample size, but the superior asymptotics of the 
superkernel estimator clearly begin to take their advantage yet for n — 400. 
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For large sample size, the better performance of the superkernel estimator 
is even more evident, obtaining nearly half the average ISE of the Sheather- 
Jones selector and less variance. Therefore, the usefulness of superkernels in 
density estimation becomes clear, at least in this case. 



4 Proofs 

The proof of our main result (Theorem 2) relies heavily on previous results 
that may be found in Chacon et al. (2006). For the sake of completeness we 
also include their statements here. 

Lemma 1. Let f be a density and K a kernel, both in L2. It is verified: 
i) R{Kh * f) ^ RU) ash^Q. 

a) If Sk — then /ion(/) — >■ as n ^ 00. 

For the proof of Theorem 2 we will need an auxiliary result. It states that 
if we use a kernel K with Sk — 0, then the MISE-convergence rate is slower 
than for every density. It can be applied, for instance, to finite-order 
kernels, as it is easy to show that any kernel of finite order satisfies Sk — 0- 

Lemma 2. If K & L2 is a kernel such that Sk — then, for every density 
/ e 1/2, we have that 

lim ri$(n, f,K) = 00. 

n—>-oo 

Proof. It is easy to show that 

/ Var^[/„,^,^(x)]dx = R{K)/{nh) - R{Kn * 

where * stands for convolution (see Wand and Jones, 1995). Therefore, 

n$(n,/,ii')=nMISE^(/„,^,,„„(^)) 

>n jYaXf[fn,KMu{f)i.x)\dx 

Then, the conclusion follows immediately from Lemma 1. □ 
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Proof of Theorem 2. li Sk > 0, then Theorem 6 states that it suffices to 
consider a density with Df < oo, such as the Fejer-de la Vallee-Poussin 
density, to get a parametric MISE-convergence rate. On the other hand, the 
previous lemma shows precisely the implication ii) =^ i). □ 

Proof of Theorem 5. In the smooth case, standard Fourier transform theory 
shows that the conditions on / ensure that 

J \t\^^\ipf{t)\^dt = 27ri?(/(^)) < oo. 

Using Parseval identity, 2TrMlSE f{f„,K,h) ^ B{h) + V{h), where 

Q<B{h) = J \ipf{t)\^\ipK{th)-l)\''dt 

0<V{h) = ^ J \ipKit)\^dt-^ J \^f{t)\'\^Kith)\'dt. 

Then, we can bound V{h) by J \(pK\'^/{nh) and 

B{h)^ f \ipf{t)\^\^K{th)-l\^dt 

J\t\>SK/h 

< [ Mt)\'dt 

J\t\>Sk/h 



so that 



MISE;(/„,^,,) < ^R{fn + 



Calculating the minimum of the expression on the right-hand-side of the 
previous display, we get 

as desired. 

For the supersmooth case, the same kind of calculations can be used to 
bound 

B{h) < e-^-^/"" /„,,(/). 
Now, taking h to be of order (logn)"^/" in B{h) + V{h) gives the proof. □ 
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